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Abstract. In this paper we present a simple database definition language: 
that of categories and functors. A database schema is a small category and 
an instance is a set-valued functor on it. We show that morphisms of schemas 
induce three "data migration functors", which translate instances from one 
schema to the other in canonical ways. These functors parameterize projec- 
tions, unions, and joins over all tables simultaneously and can be used in 
place of conjunctive and disjunctive queries. We also show how to connect a 
database and a functional programming language by introducing a functorial 
connection between the schema and the category of types for that language. 
We begin the paper with a multitude of examples to motivate the definitions, 
and near the end we provide a dictionary whereby one can translate database 
concepts into category-theoretic concepts and vice-versa. 
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1. Introduction 

This paper has two main goals. The first goal is to present a straightforward 
category-theoretic model of databases under which every theorem about small cat- 
egories becomes a theorem about databases. To do so, we will present a category 
Sch of database schemas, which has three important features: 

• the category Sch is equivalent to Cat, the category of small categories, 

• the category Sch is a faithful model for real-life database schemas, and 

• the category Sch serves as a foundation upon which high-level database 
concepts rest easily and harmoniously. 

The second goal is to apply this category-theoretic formulation to provide new 
data migration functors, so that for any translation of schemas F: C — > T>, one can 
transport instances on the source schema C to instances on the target schema D and 
vice versa, with provable "round-trip" properties. For example, homomorphisms of 
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instances arc preserved under all migration functors. While these migration func- 
tors do not appear to have been discussed in database literature, their analogues 
are well-known in modern programming languages theory, e.g. the theory of de- 
pendent types ([Mar]), and polynomial data types ([Kan]). This is part of a deeper 
connection between database schemas and kinds (structured collections of types, 
see [SH], [Ham]) in programming languages. See also Section 3.6. 

An increasing number of researchers in an increasing variety of disciplines are 
finding that categories and functors offer high-quality models, which simplify and 
unify their respective fields. 1 The quality of a model should be judged by its 
efficiency as a proxy or interface — that is, by the ease with which an expert can work 
with only an understanding of the model, and in so doing successfully operate the 
thing itself. Our goal in this paper is to provide a high-quality model of databases. 

Other category-theoretic models of databases have been presented in the past 
([JoM],[BCW],[JRW],[IP],[PS],[DC],[Tui]). Almost all of them used the more ex- 
pressive notion of sketch where we have used categories. The additional expressivity 
came at a cost that can be cast in terms of our two goals for this paper. First, 
the previous models were more complex and this may have created a barrier to 
wide-spread understanding and adoption. Second, morphisms of sketches do not 
generally induce the sorts of data migration functors that morphisms of categories 
do. 

It is our hope that the present model is simple enough that anyone who has an 
elementary understanding of categories (i.e. who knows the definition of category, 
functor, and natural transformation) will, without too much difficulty, be able 
to understand the basic idea of our formulation: database schemas as categories, 
database instances as functors. Moreover, we will provide a dictionary (see Section 
3.7, Table 1) whereby the main results and definitions in this paper will simply 
correspond to results or definitions of standard category theory; this way, the reader 
can rely on tried and true sources to explain the more technical ideas presented 
here. Moreover, one may hope to leverage existing mathematical theory to their 
own database issues through this connection. 

Before outlining the plan of the paper in Section f .2 we will give a short introduc- 
tion to the fundamental idea on which the paper rests, and provide a corresponding 
"categorical normal form" for databases, in Section f .1. 

1.1. Categorical normal form. A database schema may contain hundreds of 
tables and foreign keys. Each foreign key links one table to another, and each 
sequence of foreign keys T\ — > — > ■ ■ ■ — > T n results in a function / from the set 
of records in T\ to the set of records in T n . It is common that two different foreign 
key paths, both connecting table T\ to table T„, may exist; and they may or may 
not define the same mapping on the level of records. For example, a "landlinc" 
phone is assigned a phone number whose area code corresponds to the region in 

1 Aside from mathematics, in which category-theoretic language and theorems are indispensable 
in modern algebra, geometry, and topology, category theory has been successful in: program- 
ming language theory [Mog],[Pic]; physics [BS],[DI]; materials science [SGWB],[GSB]; and biology 
[Ros],[EV]. 
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which the phone is located. Thus we have two paths P — > R: 

N c 

(1) ' 




and the data architect for this schema knows whether or not these two paths should 
always produce the same mapping. In (??), the two paths do produce the same 
mapping, but note that if landlincPhonc was replaced with mobilePhone, they 
would not produce the same mapping! Thus we have the notion of path equivalence 
relation, which provides a crucial constraint. Its enforcement is often left to the 
application layer, but it should actually be included as part of the schema ([JD]). 
Including path equivalence information in the database schema has three main 
advantages: 

• it permits the inclusion of "hot" query columns without redundancy, 

• it provides an important check for creating schema mappings, and 

• it promotes healthy schema evolution. 

A category (in the mathematical sense) is roughly a graph with one additional bit 
of expressive power: the ability to declare two paths equivalent. We now have the 
desired connection between database schemas and categories: Tables in a schema 
are specified by vertices (or as we have drawn them in Diagram (1), by boxes); 
columns are specified by arrows; and functional equivalence of foreign key paths are 
specified by (the category-theoretic notion of) path equivalence. The categorical 
definition of schema will be presented rigorously in Section 3. 

The above collection of ideas leads us to the following normal form for databases. 

Definition 1.1.1. A database is in categorical normal form if 

• every table t has a single primary key column ID t , chosen at the outset. 
The cells in this column are called the row-ids of t; 

• for every column c of a table i, there exists some target table t' such that 
the value in each cell of column c refers to some row-id of t' . We denote 
this relationship by 

c: t -> 

• in particular, if some column d of t consists of "pure data" (such as strings 
or integers), then its target table t' is simply a 1-column table (i.e. a 
controlled vocabulary, containing at least the active domain of column d) , 
and we still write d: t — > t'\ and 

• when there are two paths p, q through the database from table t to table u 
(denoted p: t — > u, q: t — » u) and it is known to the schema designer that 
p and q should correspond to the same mapping of row-ids, then this path 
equivalence must be declared as part of the schema. We denote this path 
equivalence by 

P-9- 
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1.2. Plan of the paper. We begin in Section 2 by giving a variety of examples, 
which illustrate the virtues of the category-theoretic approach. These include con- 
ceptual clarity, simplified data migration, updatable views, interoperability with 
RDF data, and a close connection between data and program. In Section 3 we 
will give the precise definitions of categorical schemas and translations, and show 
that the category thereof, denoted Sch, is equivalent to Cat, the category of small 
categories. In this section we will also define the category of database instances on 
a given schema. In Section 4 we will define the data migration functors associated 
to a translation and begin to wrap up our tour by returning to the examples from 
Section 2 for a more in-depth treatment. We finish this work in Section 5, where 
we discuss data types and filtering. 

1.3. Terminology and notation. One obstacle to writing this paper is a certain 
overlap in terminology between databases and categories: the word "object" is com- 
monly used in both contexts. While object databases are interesting and perhaps 
relevant to some of the ideas presented here, we will not discuss them at all, hence 
keeping the namespace clear for categorical terminology. Unless otherwise spec- 
ified, the word object will always be intended in the category-theoretic 
sense. 

Say we have maps i ^ 5 4 C; we may denote their composite A — > C in 
one of two ways, depending on what seems more readable. The first is called 
"diagrammatic order" and is written as /; g. The second is called "classical order" 
and is written as g o f. We may sometimes choose not to write a symbol between 
/ and g, and in that case we use diagrammatic order fg: A —} C. 

1.4. Acknowledgments. The present paper is a total revamping of [Sp3], which, 
in an attempt to accommodate two disjoint communities (mathematicians and 
database experts), ended up as a sprawling and somewhat incoherent document. 
My thanks go to the referees and Bob Harper for pointing me toward a stream- 
lined presentation. I would also like to thank Peter Gates, Dave Balaban, John 
Launchbury, and Greg Morrisett for many useful conversations. Special thanks 
also go to Scott Morrison for coding many of the ideas from this paper into 
working form, available online for demonstration or open-source participation at 
http: / /categoricaldata.net /. 

2. Virtues by Example 

In what follows we illustrate the merits of the category-theoretic approach by 
way of several examples. One thing to note is that each of these features flows 
naturally from our compact mathematical definitions of schemas and translations. 
These definitions will be given in Section 3. 

2.1. Conceptual clarity. In a categorical schemas (Definition 3.2.6), every table 

T c U 

is a vertex and every column is an arrow. An arrow • >• represents a column of 

table T, with target table U, i.e. a foreign key constraint declaring that each cell in 
column c refers to a row-id in table U. We draw a box around our system of vertices 
and arrows, and the result is a categorical representation of the schema 2 . 



A system of vertices and arrows of this sort is called a graph. A graph can be considered as a 
kind of category (a so-called free category) in which no path equivalences have been declared. 
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Example 2.1.1. The following picture represents a schema S with six tables, two of 
which arc multi-column "fact tables" and four of which are 1-column "leaf" tables. 



(2) 



C := 




The fact table Tl has three columns (pointing to SSN, First, Last), in addition 
to its ID column; the fact table T2 also has three non-ID columns (pointing to 
First, Last, Salary). The leaf tables, SSN, First, Last, and Salary have only ID 
columns, as is seen by the fact that no arrows emanate from them. 

As a set of tables, an instance on schema C may look something like this: 

(3) 



Tl 


ID 


SSN 


First 


Last 


Tl-001 


115-234 


Bob 


Smith 


Tl-002 


122-988 


Sue 


Smith 


Tl-003 


198-877 


Alice 


Jones 



T2 


ID 


First 


Last 


Salary 


T2-A101 


Alice 


Jones 


$100 


T2-A102 


Sam 


Miller 


$150 


T2-A104 


Sue 


Smith 


$300 


T2-A110 


Carl 


Pratt 


$200 



(4) 



SSN 



ID 



115-234 



118-334 



122-988 



198-877 



342-164 



First 



ID 



Adam 



Alice 



Bob 



Carl 



Sam 



Sue 



Last 




Salary 


ID 




ID 


Jones 




$100 


Miller 




$150 


Pratt 




$200 


Richards 




$250 


Smith 




$300 



The thing to recognize here is that each column header c of Tl (respectively, of 
T2) points to some target table, in such a way that every cell in column c refers to 
a row-id in that target table. 3 The leaf tables serve as controlled vocabularies for 
the fact tables. 

Notation 2.1.2. In Example 2.1.1, we wrote out all four leaf tables (Display 4). 
In the future we will not generally write out leaf tables for space reasons. In fact, 
any table that does not add explanatory power to a given example may be left out 
from our displays. 



^In the case that c is the ID column of Tl, the target table to which c points is Tl, and each cell 
in column c is a row-id in Tl that refers to itself. 
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Example 2.1.3. In this example we present a schema C that includes path equiva- 
lences, and hence takes advantage of the full expressivity of categories. We imagine 
a company with employees and departments; every employee is in a department, 
every employee has a manager employee, and every department has a secretary 
employee. Using path equivalences, we enforce the following facts: 

• the manager of an employee is in the same department as that employee, 
and 

• the secretary of a department is in that department. 

These facts are recorded as equations at the top of the following diagram: 



(5) 



C := 



Mgr isin ~ isin Seer isin ~ id Departmellt 



Mgr 



First 



Stringl 




Department 



Sccr 



Last. 



String2 



String3 



As a set of tables, an instance on C may look something like this: 



Employee 


ID 


First 


Last 


Mgr 


isin 


101 


David 


Hilbert 


103 


qlO 


102 


Bertrand 


Russell 


102 


x02 


103 


Alan 


Turing 


103 


qlO 



Department 


ID 


Name 


Seer 


qlO 


Sales 


101 


x02 


Production 


102 



It is no coincidence that there are a total of six non-ID columns in the two tables 
and a total of six arrows in the schema C. The equations can be checked on these 
cells; for example we can check the first equation on row-id 101: 

101. Manager, isin = 103.isln = qlO and 101. isin = qlO. 

An instance / on C is only valid if the two equations hold for every row in /. 

2.2. Simplified data migration. A translation F; C — > T> of schemas (Definition 
3.3.1) is a mapping that takes vertices in C to vertices in T> and arrows in C to 
paths in T>; in so doing, it must respect arrow sources, arrow targets, and path 
equivalences. We will be using the following translation F : C — > V throughout 
Section 2.2. 
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(6) 



C :-- 



SSN 




# 




/First 




/ • 




// \ 

Tl / 


x T2 


• 


• 


\ / 
Last / 




• / 




Salary 




• 






The mapping F is drawn as suggestively as possible. In the future, we will rely 
on this "power of suggestion" to indicate the translations, but this time we will 
be explicit. Each of the four leaf vertices, SSN, First, Last, and Salary in C is 
mapped to the vertex in T> of the same label. The two other vertices in C, namely 
Tl and T2, are mapped to vertex T in T>. Since translations must respect arrow 
sources and targets, there is no additional choice about where the arrows in C are 
sent; for example F sends the arrow Tl — > First in C to the arrow T —¥ First in 
V. 

We have now specified a translation F from schema C to schema £>, pictured in 
Diagram 6. Springing forth from this translation are three data migration functors, 
which we will discuss in turn in Examples 2.2.1, 2.2.3, and 2.2.5. They migrate 
instance data on T> to instance data on C and vice versa. Instances on C (respectively 
on T>) form a category, which we denote by C-Inst (respectively by X>-Inst); they 
are defined in Definition 3.5.1. We have the following chart, the jargon of which 
will be introduced shortly: 



Data migration functors induced by a translation F : C — > T> 


Name 


Symbol 


Long symbol 


Idea of definition 


Pullback 




A F : X>-Inst -)■ C-Inst 


Composition with F 


Right Pushforward 


n F 


n F : C- Inst V Inst 


Right adjoint to Af 


Left Pushforward 




S F : C-Inst P-Inst 


Left adjoint to Ap 



Everything in the chart will be defined in Section 4. For now we give three examples. 
In each, we will be starting with the translation F : C — > I?, given above in Diagram 
(6). 

Example 2.2.1 (Pullback). Let F : C — > T> be the translation given in Diagram (6). 
In this example, we explore the data migration functor Ap : P-Inst — > C-Inst 4 
by applying it to a £>-instance J. Note that even though our translation F points 
forwards (from C to T>), our migration functor Af points "backwards" (from 2?- 
instances to C-instances). We will see why it works that way, but first we bring the 
discussion down to earth by working with a particular P-instance. 



We have not defined Ap yet; this will be done in Section 4.1. 
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Consider the instance J, on schema T>, defined by the table 



(7) 



T 


ID 


SSN 


First 


Last 


Salary 


XF667 


115-234 


Bob 


Smith 


$250 


XF891 


122-988 


Sue 


Smith 


$300 


XF221 


198-877 


Alice 


Jones 


$100 



and having the four leaf tables from Example 2.1.1, Display (4). Pulling back along 
the translation F, we are supposed to get an instance Ap(J) on schema C, which 
we must describe. But the description is easy: Ap{J) splits up the columns of 
table T according to the translation F. The four leaf tables will be exactly the 
same as above (i.e. the same as in Example 2.1.1 (4)), and the two fact tables will 
be something like 5 

(8) 



Tl 


ID 


SSN 


First 


Last 


XF667T1 


115-234 


Bob 


Smith 


XF891T1 


122-988 


Sue 


Smith 


XF221T1 


198-877 


Alice 


Jones 



T2 


ID 


First 


Last 


Salary 


A21 


Alice 


Jones 


$100 


A67 


Bob 


Smith 


$250 


A91 


Sue 


Smith 


$300 



The fact that Tl and T2 are simply projections of T is a result of our choice of 
translation F. 

Remark 2.2.2. We have seen that the pullback functor Ap, which arises naturally 
for any translation F between schemas, automatically produces projections. 

In the next two examples, we will explore the right and left pushforward migra- 
tion functors induced by the translation F: C —> T> given in Diagram (6). These 
functors, denoted Wf and T,p , send C-instances to P-instances. Thus we start with 
the instance I (which was presented in Example 2.1.1) and explain its pushforwards 
IIi?(-0 and £f(-0 below in Examples 2.2.3 and 2.2.5, respectively. 

Example 2.2.3 (Right Pushforward). Let F: C — > T) be the translation given in 
Diagram (6) . In this example, we explore the data migration functor 11^ : C-Inst — > 
Z?-Inst 6 by applying it to the C-instance I shown in Displays (3) and (4). Note that 
our migration functor IIf points in the same direction as F: it takes C-instances 
to "D-instances. We now describe the P-instance IIf (I), which has four leaf tables 
n F (7)(SSN), etc., and one fact table U F (I)(T). 

The four leaf tables of Hf(I) will be as in Display (4). The fact table of II f (I) 
will be the join of Tl and T2: 



T 


ID 


SSN 


First 


Last 


Salary 


T1-002T2-A104 


122-988 


Sue 


Smith 


$300 


T1-003T2-A101 


198-877 


Alice 


Jones 


$100 



Remark 2.2 A. We have seen that the right pushforward functor IIf, which 
arises naturally for any translation F between schemas, automatically produces 
joins. 



^There may be choice in the naming convention for row-ids. 
^We have not defined IIf yet; this will be done in Section 4.2. 
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Example 2.2.5 (Left Pushforward) . Let F; C — > V be the translation given in 
Diagram (6). In this example, we explore the data migration functor Hp '■ C-Inst — ¥ 
D-Inst 7 by applying it to the C-instance I shown in Displays (3) and (4). Note that 
our migration functor Hp points in the same direction as F: it takes C-instances 
to T>- instances. We now describe the D-instance Hp(I), which has four leaf tables 
EfCO(SSN), etc., and one fact table (7)(T). 

Instead of being a join, as in the case of Hf(I ) above, the fact table T in instance 
Ef (J) will be the union of Tl and T2. One may wonder then how the category 
theoretic construction will deal with the fact that records in Tl do not have salary 
information and the records in T2 do not have SSN information. The answer is 
that the respective cells are Skolemized. In other words, the universal answer is to 
simply add a brand new "variable" wherever one is needed in and downstream of 
T. Thus in instance T,p(I), table T looks like this: 



T 


ID 


SSN 


First 


Last 


Salary 


Tl-001 


115-234 


Bob 


Smith 


Tl-001. Salary 


Tl-002 


122-988 


Sue 


Smith 


Tl-002. Salary 


Tl-003 


198-877 


Alice 


Jones 


Tl-003. Salary 


T2-A101 


T2-A101.SSN 


Alice 


Jones 


$100 


T2-A102 


T2-A102.SSN 


Sam 


Miller 


$150 


T2-A104 


T2-A104.SSN 


Sue 


Smith 


$300 


T2-A110 


T2-A110.SSN 


Carl 


Pratt 


$200 



The Skolem variables (such as Tl-001. Salary) can be equated with actual values 
later. They can also be equated with each other; for example we may know that 
Tl-001. Salary=T2-002. Salary, without knowing the value of these salaries. 

Remark 2.2.6. We have seen that the left pushforward functor Hp, which 
arises naturally for any translation F between schemas, automatically produces 
unions and automatically Skolemizes unknown values. 

2.3. Updatable views and linked multi-views. Suppose we have a translation 
F: C — > T>. In this case we can consider T> as a view on C and consider C a view 
on £>. Unlike the classical version of views, our definition allows for arbitrarily 
many foreign keys between view tables; indeed, both C and T> can be arbitrary 
schemas. Typical relational databases management systems such as SQL do not 
support "linked multi- views", i.e. multiple view tables with foreign keys between 
them. For our data migration functors Hf,Hf and A^, this is no problem. 

In fact, by the very nature of these three migration functors (i.e. by definition 
of the fact that they are functors), we have access to powerful theorems relating 
updates of C-instances to updates of X>-instances. For example, given an instance I 
on T> whose A^-view is the instance J — Ap(I) on C, and given an update J — > J' 
on C, there is a unique update I — > Hp J', of instances on T>. A similar result holds 
for T,p in place of Hp: these facts follow from the fact that II (respectively Hp) is 
"adjoint" to A^- See Section 4. 

The view update problem is often phrased as asking that "the round trips are 
equivalences," ([BCFGP]), which for us amounts to the composites ApHp and 
HfAf (respectively HfAf and AfHf) being isomorphisms. This will only happe 



We have not defined Sf yet; this will be done in Section 4.3. 



10 



DAVID I. SPIVAK 



in case F is an equivalence of categories. However, our data migration adjunctions 
provide view updates in more general circumstances, and these have provable formal 
properties (e.g. Y*p and Ap commute with inserts and II ^ and commute with 
deletes). But of course the best formal properties occur when F is an equivalence. 

The following example shows two things. First, it gives an example of a linked 
multi-view (foreign keys between views). Second, the translation F is an equivalence 
of categories (a fact which relies essentially on the fact that C has path equivalences 
declared), and so the data migration functors A, II, and E are also equivalences — 
they exhibit no information loss. 

Example 2.3.1. Consider the two schemas drawn here: 



(9) 





V 



The arrows i\2 and i^i, which are declared to be mutually inverse, ensure that 
the data which can be captured by schema C is equivalent to that which can be 
captured by schema T>. The translation F sends ii2 and 121 to the trivial path idi 
on T (see Definitions 3.2.1 and 3.3.1, and Example 3.3.3). 

If table T is as in Example 2.2.1 (7), then its pullback under A^ to an instance 
on C is similar to (8), but with additional columns ii2 and 121 (because our schema 
has additional arrows i\2 and «2i): 

(10) 



Tl 


ID 


SSN 


First 


Last 


il2 


XF667T1 


115-234 


Bob 


Smith 


A67 


XF891T1 


122-988 


Sue 


Smith 


A91 


XF221T1 


198-877 


Alice 


Jones 


A21 



T2 


ID 


First 


Last 


Salary 


121 


A21 


Alice 


Jones 


$100 


XF221T1 


A67 


Bob 


Smith 


$250 


XF667T1 


A91 


Sue 


Smith 


$300 


XF891T1 



The foreign key columns ii2 and 121 on the C-view keep track of the data necessary 
for successful round-tripping. An update to a ©-instance will yield a corresponding 
update to a C-instance and vice versa. The fact that F is an equivalence of categories 
implies that Si?,II^, and Ap are also equivalences of categories, and roundtrip 
isomorphisms will hold for all possible updates. 
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2.4. Interoperability with RDF data. The Resource Descriptive Framework 
(RDF) is the semantic web standard data format [KC]. The basic idea is to encode 
all facts in terms of basic 

(Subject Predicate Object) 
triples, such as (Bob hasMother Sue). There are papers devoted to understanding 
the transformation from relational databases to RDF triple stores, and vice versa 
([ASX],[KT]). In this section we will assume a basic familiarity with the jargon of 
that field, such as URI (uniform resource identifier) . 

Category-theoretically, the formulation of RDF triple stores is quite simple. 
Given a schema C, a triple store over C is a category S (representing the triples) 
and a functor ir: S — > C (representing their types). The objects in S are URIs; the 
arrows in S are triples 

Subject Predicate Object 
• > • 

Given an object c G C in the schema, the inverse-image 7r _1 (c) C S consists of all 
URIs of type c. Given an arrow /: c—¥d in C, the inverse image is the /-relation 
between 7r _1 (c) and 7r _1 (d). 8 

There is a basic category-theoretic operation that converts a relational data- 
base instance into an RDF triple store (and a straightforward inverse as well, con- 
verting an RDF triple store into a relational database instance). It is called the 
Grothendieck construction. Consider for example the instance I from Example 2.1.3 
Display (5): 



Employee 


ID 


First 


Last 


Mgr 


isln 


101 


David 


Hilbert 


103 


qlO 


102 


Bertrand 


Russell 


102 


x02 


103 


Alan 


Turing 


103 


qlO 



Department 


ID 


Name 


Seer 


qlO 


Sales 


101 


x02 


Production 


102 



This relation can be functional or inverse functional, as dictated by the RDF schema; the subject 
can be understood category-theoretically by so-called "lifting constraints" (see [Sp4]). 
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Taking the Grothendieck construction yields the following triple store S = Gr(I), 
where each arrow designates a RDF triple, as above: 



(11) 



S = 





Production / Name 



c = 



Mgr 




isin 



Department 



Sccr 



String3 



In Display (11), ten arrows have been left out of the picture of S, (e.g. the arrow 



102 
• - 



La^t — ^Russeii pictured) for readability reasons. The point is that the 



RDF triple store associated to instance / is nicely represented using the standard 
Grothendieck construction. 

2.5. Close connection between data and program. Currently, there is an 
"impedance mismatch" between databases and programming languages; their re- 
spective formulations and underlying models do not cohere as well as they should 
([CI]). Whereas the programming languages (PL) community has embraced cate- 
gory theory for the conceptual clarity and expressive power it brings, most database 
theorists tend to concentrate on practical considerations, such as speed, reliability, 
and scalability. The importance of databases in the modern world cannot be over- 
stated, and yet in order for databases to reach their full potential, better theoretical 
integration with applications must be developed. 

As stated in the Introduction (Section 1), the first goal of this paper is to present 
a straightforward model of databases under which every theorem about small cat- 
egories becomes a theorem about databases. Thus the favorite category of PL 
theorists, namely the category Type of types and terms (for some fixed A-calculus, 
see [Awo, Section 6.5]), is a kind of infinite database schema: its tables correspond 
to types and its foreign key columns correspond to terms. Of course, unlike real- 
world databases in which tables model real-world entities and their relationships 
(such as people and their heights), the schema Type models mathematical entities 
and their relationships (such as integers and their factorials). However, these ideas 
clearly live in the same platonic realm, so to speak, and this notion is expressed by 
saying that both database schemas and Type can be considered as categories and 
related by functors. 
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This leads to nice integration between data and program. For example many 
spreadsheet capabilities, such summing up the values in two columns to get the 
value in a third, can be included at the schema level. At this point in the paper, we 
do not yet have the necessary machinery to show exactly how that should work (see 
Section 5.1), but in the following diagram one can see the schematic presentation 
of the relevant subcategory of Type: 



outl 




outr 



The point is to simultaneously see two different things within this one diagram (like 
an optical illusion). The first thing to see in Diagram (12) is a database schema. 
In schema P, we have two tables: 



(Int.Int) 


ID 


outl 


outr 


+ 


POcO 











PlcO 


1 





1 


Plcl 


1 


1 


2 


POcl 





1 


1 


P2c0 


2 





2 


P2cl 


2 


1 


3 


P2c2 


2 


2 


4 


Plc2 


1 


2 


3 


P0c2 





2 


2 











Int 
ID" 
0~~ 



_3 

4 



The second thing to see in Diagram (12) is a subcategory P C Type, i.e. a close 
connection to standard PL theory. The same category P is viewed extensionally in 
the context of databases and intentionally in the context of programs. This dual 
citizenship of categories makes category theory a good candidate for solving the 
impedance mismatch between databases and programming languages. 

3. Definitions 

In this section, our main goal is to define a category of schemas and translations, 
and to show that it is equivalent to Cat, the category of small categories. Along 
the way we will define the category of instances on a given schema. Finally, we 
will give a dictionary that one can use to translate between database concepts (e.g. 
found in [EN]) and category-theoretic concepts (e.g. found in [Mac]). 

3.1. Some references. Throughout this section, we will assume the reader has 
familiarity with the fundamental notions of category theory: objects, morphisms, 
and commutative diagrams within a category; as well as categories, functors, and 
natural transformations. There are many good references on category theory, in- 
cluding [LS], [Sic], [Pie], [BW1], [Awo], and [Mac]; the first and second are suited 
for general audiences, the third and fourth are suited for computer scientists, and 
the fifth and sixth are suited for mathematicians (in each class the first reference 
is easier than the second). One may also see [SK] for a different perspective. 
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3.2. Graphs, Paths, Schemas, and Instances. A graph (sometimes called a 
directed multi-graph) is a collection of vertices and arrows, looking something like 
this: 



a 

a f b / ^ c 

• s- • • 




k 



This is one graph with two connected components; it has five vertices and six 
arrows. 

Definition 3.2.1. A graph G is a sequence G — (A,V, src,tgt), where A and V 
are sets (respectively called the set of arrows and the set of vertices of G), and 
src : A — > V and tgt : A — > V are functions (respectively called the source function 
and the target function for G). If a G A is an arrow with source src{a) = v and 
target tgt(a) — w, we draw it as 

a 

V > W. 

Definition 3.2.2. Let G = (A, V, src, tgt) be a graph. A path of length n in G, 

(n) 

denoted p G Path G is a head-to-tail sequence 

(14) p = (U > Vi > V 2 > ... > V n ) 

of arrows in G. In particular, Path^ = A and Path^' = V; we refer to the path 
of length on vertex v as the trivial path on v and denote it by id„ . We denote by 
Pathc the set of all paths on G, 

Path G := |J Path^. 

Every path p £ Pathc has a source vertex and a target vertex, and we may abuse 
notation and write src, tgt: Pathc — > V. If p is a path with src(p) = v and 
tgt(p) = w, we may denote it by p: v — > w. Given two vertices v,w G V, we write 
Pathg (v, w) to denote the set of all paths p: v — > w. 

There is a composition operation on paths. Given a path p: v — >• w and q : w — > x, 
we define the composition, denoted pq: v — > x in the obvious way. In particular, 
if p (resp. r) is the trivial path on vertex v (resp. vertex w) then for any path 
q: v — > w, we have pq = q (resp. qr = q). Thus, for clarity, we may always denote 
a path as beginning with a trivial path on its source vertex; e.g. the path p from 
Diagram (14) may be denoted p = id Vo aia2 ■ ■ ■ a n . 

Example 3.2.3. In Diagram (13), there are no paths from A to D, one path (/) 
from A to B, two paths (fg and fh) from A to C, and infinitely many paths 
{iP^jk)" 1 ■ ■■i^{jk)^ | n, Pl , qi , . . . ,p n ,q n G N}) from D to D. 

We now define the notion of categorical equivalence relation on the set of paths of 
a graph. Such an equivalence relation (in addition to being reflexive, symmetric, and 
transitive) has two sorts of additional properties: equivalent paths have the same 
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source and target, and the composition of equivalent paths with other equivalent 
paths must yield equivalent paths. Formally we have Definition 3.2.4. 

Definition 3.2.4. Let G = (A,V, src,tgt) be a graph. A categorical path equiva- 
lence relation (or CPER) on G is an equivalence relation ~ on Pathg that has the 
following properties: 

(1) If p ~ q then src(p) — src(q). 

(2) If p ~ q then tgt(p) = tgt(q). 

(3) Suppose p, q: b — >■ c are paths, and m : a — > b is an arrow. If p ~ q then 
mp ~ mq. 

(4) Suppose p,q: a —> b are paths, and n: 6 — > c is an arrow. If p ~ g then 

~ gn. 

Lemma 3.2.5. Suppose that G is a graph and ~ is a CPER on G. Suppose 
p ~ g: a — > 6 and r ~ s: 6 — > c. T/ien pr ~ gs. 

Proof. The picture to have in mind is this: 

• — *- • • • — s- • • — • • • — s- • 




• — s- • • ■ — s- • • — s- • ■ ■ — s- • 



Applying condition (3) from Definition 3.2.4 to each arrow in path p, it follows by 
induction that pr ~ ps. Applying condition (4) to each arrow in path s, it follows 
similarly that ps ~ qs. Because ~ is an equivalence relation, it follows that pr ~ gs. 

□ 

Definition 3.2.6. A categorical schema C consists of a pair C :— (G, ~) where G 
is a graph and ~ is a categorical path equivalence relation on G. We sometimes 
refer to a categorical schema as simply a schema. 

Example 3.2.7. Consider the schema, i.e. the graph together with the indicated 
equivalence 9 , pictured in the box below: 

f_9 - fh 




ii 



This schema models, for example, the phenomenon of sending an email to oneself. 
Suppose we populate B with emails, C with people, g and h with the sender and 
receiver fields, respectively. Then for fg to equal fh we must have that senders 
equal receivers on the image of /, and thus the subset of self-emails is a perfect fit 
for A. See example 3.2.9. 



More precisely, consider the graph with the categorical equivalence relation generated by the set 

{fg = fh}. 
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More on the subject of categorical schemas, including a picture of a schema and 
an associated set of tables, can be found in Example 2.1.3. 

In the following, we will define what it means to be an instance of a categorical 
schema C. We consider the case in which our instances are set-models of C, but the 
same idea works in much more generality (see Definition 3.5.1). 

Definition 3.2.8. Let C := (G, ~) be a categorical schema, where G = (A, V, src, tgt). 
An instance on C, denoted /, consists of the following 

(1) For every vertex v 6 V, a set I(v). 

(2) For every arrow a: v — )■ v' in A, a function 1(a) : I(v) I(v') 

(3) For every path equivalence p~ga guarantee that the equation I(p) = I(q) 
holds. 10 

Example 3.2.9. We now return to Example 3.2.7, and write down a sample instance 
/ for schema C — (G, ~). 



A 

ID II / 

SEml207 ' Eml207 
SEml210 "Eml210 
SEml211 ' Eml211 



For each vertex v in G, the set I(v) is given by the set of rows in the corresponding 
table (e.g. 1(A) = {SEml207, SEml210, SEml211}). For each arrow a: v ->• w 
in G the function 1(a) : I(v) — > I(w) is also evident as a column in the table. 
For example, 1(g) : 1(B) — > 1(C) sends Eml206 to Bob, etc. Finally, the path 
equivalence fg — fh is borne out in the fact that for every row-id in table A, 
following / then g returns the same result as following / then h. 

3.3. Translations. A translation is a mapping from one categorical schema to 
another. Vertices are sent to vertices, arrows are sent to paths, and all path equiv- 
alences are preserved. More precisely, we have the following definition. 

Definition 3.3.1. Let G = (A G ,V G , srcc, tgto) and H — (Ajj, Vu, svch, tgtn) be 
graphs (see Definition 3.2.1), and let C = (G, ~c) and V = (H,~x>) be categorical 
schemas. A translation F from C to T>, denoted F: C — > T> consists of the following 
constituents: 

(1) a function Vf ■ Vg Vh, and 

(2) a function Ap : Aq — > Pathp 

subject to the following conditions: 

^Once I is defined on arrows, as it is in item (2), we can extend it to paths in the obvious way: 
if p = a\a,2 ■■■<!„, then the function I(p) is the composition I(p) = I(a\)I(a,2) ■ ■ ■ I(a n ). 



B 


ID 


9 


h 


Eml206 


Bob 


Sue 


Eml207 


Carl 


Carl 


Eml208 


Sue 


Martha 


Eml209 


Chris 


Bob 


Eml210 


Chris 


Chris 


Eml211 


Julia 


Julia 


Eml212 


Martha 


Chris 



c 

ID 

Bob 

Carl 

Chris 

Julia 

Martha 

Sue 
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(a) the function Af preserves sources and targets; in other words, the following 
diagrams of sets commute: 

Aq — i- Path H Aq — -4- Path ff 



tgtc 



tgtn 



Vq V H V G V H 

u V f u V f 

(b) the function Af preserves path equivalences. 11 Precisely, suppose we are 
given lengths m, n £ N and paths p = id Vo fif2 ■ ■ ■ f m and q — id VQ g\g2 • ■ ■ g n 
in G. Let v' Q — Vp(vo) and for each i < m (resp. j < n), let /? = 
(resp. g'j = A F (gj)), and let p' = id,,/ ft ft ■ ■ ■ f' m (resp. q' = 
^v' 9'i92 ■■■9n)- Hp-cq then p' ~ v q' . 
Two translations F, F 1 : C — > T> are considered identical if they agree on vertices 
(i.e. Vf = Vf>) and if, for every arrow / in C, there is a path equivalence 

A F {f) ~v Api (/). 

In the following two examples we will reconsider translations discussed in Section 

2. 

Example 3.3.2. Recall the mapping F given in Diagram (6). The schemas C and T> 
are just graphs in the sense that there are no declared path equivalences in cither 
of them. The mapping F sends vertices in C to vertices in T> and arrows in C to 
arrows in T>. Since an arrow is a particular sort of path, and since there are no 
path equivalences to be preserved, F: C —> T> is indeed a translation. 

Example 3.3.3. Recall the mapping F given in Example 2.3.1, Diagram (9). In 
this setup, C has declared path equivalences and T> does not; however T> still has a 
categorical path equivalence relation ~x> on it, the minimal reflexive relation. The 
mapping F on vertices (Vp) is self-explanatory; the only arrows on which Ap is 
not self-explanatory are i\2 and 121, both of which are sent to the trivial path idi 
on vertex T. 

Because Vf(T1) = Vf(T2) = T, it is clear that Af preserves sources and tar- 
gets. The path equivalence 112*21 = idn and 121*12 = idi2 are preserved because 
^4^(^12) = Af(*2i) — id-Ei ancl the concatenation of a trivial path with any path p 
yields p. 

3.4. The equivalence Sch ~ Cat. We assume familiarity with categories and 
functors, and in particular the category Cat of small categories and functors (a list 
of references is given in Section 3.1). In this section we will define the category Sch 
and show it is equivalent to Cat. It is this result that justifies our advertisement 
in the introduction that "every theorem about small categories becomes a theorem 
about databases". 

Definition 3.4.1. Recall the notions of categorical schemas and translations from 
Definitions 3.2.G and 3.3.1. The category of categorical schemas, denoted Sch, 
is the category whose objects are categorical schemas and whose morphisms are 
translations. 



^This is easier to understand conceptually than to write down. 
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Construction 3.4.2 (From schema to category). We will define a functor L : Sch — > 
Cat. Let C — (G, ~c) be a categorical schema, where G = (A, V, src, tgt). Define C 
to be the free category with objects V generated by arrows A. Define L(C) <E Cat 
to be the category defined as the quotient of C by the equivalence relation ~c (see 
[Mac, Section 2.8]). This defines L on objects of Sch. 

Given a translation F: C — > V, there is an induced functor on free categories 
F' : C' V , sending each generator / e A to the morphism in V defined as the 
composite of the path Ap(f). The preservation of path equivalence ensures that 
F' descends to a functor L(F) : L(C) — > L(T>) on quotient categories. This defines 
L on morphisms in Sch. It is clear that L preserves composition, so it is a functor. 

Construction 3.4.3 (From category to schema). We will define a functor R: Cat — > 
Sch. Let C be a small category with object set Ob(C), morphism set Mor(C), source 
and target functions s,t: Mor(C) — > Ob(C), and composition law o: Mor(C) x Q b(c) 
Mor(C) -> Mor(C). Let R{C) = (G, ~) where G is the graph 

G = (Mor(C),Ob(C),s,t) 

and with ~ defined as follows: for all /, g £ Mor(C) with t(f) = s(g) we put 

(15) fg * (g o /). 

This defines R on objects of Cat. 

A functor F : C — > T> induces a translation R(F) : R{C) — > R(T>), because vertices 
are sent to vertices, arrows are sent to arrows, and path equivalence is preserved 
by (15) and the fact that F preserves the composition law. This defines R on 
morphisms in Cat. It is clear that R preserves compositions, so it is a functor. 

Theorem 3.4.4. The functors 

L : Sch ± Cat R 

are mutually inverse equivalences of categories. 

Sketch of proof . It is clear that there is a natural isomorphism e: idcat L o R; 
i.e. for any category C, there is an isomorphism C = L(R(C)). Thus the functor L 
is essentially surjective. We first show that L is fully faithful. 

Choose schemas X and Y, and suppose X = (Ax,Vx, srcx,tgtx)', we must 
show that the function L\ \ Homs c h(AT, Y) — > Homcat (LX, LY) is a bijection. It 
is clearly injective. To show that it is surjective, choose a functor G: LX — > LY; 
we will define a translation F: X — s- Y with L\{F) = G. Define F on vertices of X 
as G is defined on objects of LX. Define F on arrows of X via the function Ax — > 
Pathjf — > Mor(LX) — > Mor(LY"), and choose a representative for its equivalence 
class from Path^y (note that any two choices result in the same translation: see 
Definition 3.3.1). Two equivalent paths in X compose to the same element of 
Mor(LX), so F preserves path equivalence. This defines F, completing the proof 
that L an equivalence of categories. 

By similar reasoning one proves that R is fully faithful, and concludes that it is 
inverse to L. 

□ 
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3.5. The category of instances on a schema. Given Theorem 3.4.4, the com- 
pound notion of categorical schemas and translations is equivalent to that of cate- 
gories and functors. In the remainder of the paper, we elide the difference between 
Sch and Cat, using nomenclature from each interchangeably. 

One sees easily (c.f. Definition 3.2.8) that an instance J on a schema C is the 
same thing as a functor C —> Set, where Set is the category of sets. Thus we have 
a ready-made concept of morphisms between instances: natural transformations 
of functors. This is an established notion in database literature, often called a 
homomorphism of instances (see e.g. [DNR]). 

Definition 3.5.1. Let C be a schema and let /, J: C — > Set be instances on C. A 
morphism m from I to J, denoted m: I — > J, is simply a natural transformation 
between these functors. We define the category of instances on C, denoted C-Inst, 
to be the category of instances and morphisms, as above. 

More generally, let S denote any category; we define the category of S-valued in- 
stances on C, denoted C-Instg := § c , to be the category whose objects are functors 
C — > S and whose morphisms are natural transformations. We refer to S as the 
value category in this setup. 

Remark 3.5.2. It appears that programming language theorists do not include ho- 
momorphisms between instances in their conception of database instances as ele- 
ments of a type, preferring instead to work with just the set Ob(C-Inst) of instances 
on a schema C. Doing so makes it easier to define aggregate functions, such as sums 
and counts; see e.g. [IT]. 

In the rest of the paper, we will generally work with C-Inst, the category of 
Set-valued instances. However, most of the results go through more generally for 
C-Instg, provided that S is complete and cocomplete (i.e. has all small limits and 
all small colimits). Obviously, given a functor S — > §' there is an induced functor 
C-Instg — )■ C-Instg' , so the choice of value-category can be changed without much 
cost. 

Example 3.5.3. Given a schema C, there are many categories 8, other than § = Set, 
for which one might be interested in C-Inst§. For example, given a lambda calculus, 
the associated category § = Type of types and terms is a good choice ([Awo, 
Section 6.5]). One can also use the category Fin of finite sets, cpo of complete 
partial orders, Cat of small categories, or Top of topological spaces. 

The choice of value-category is based on how one chooses to view the collection of 
rows in each table. We usually consider this collection to be a set, but for example 
one can imagine instead a topological space of rows, and in this case each column 
would consist of a continuous map from one space to another. 

Proposition 3.5.4. If S — Set then for any schema C G Sch, the category 
C-Inst = C-Instset is a topos. If C is a finite category then for any topos S 
(e.g. S = Fin, the category of finite sets), the category C-Instg is a topos. IfS is 
complete (resp. cocomplete) then C-Inst§ is also complete (resp. cocomplete). 

Proof. The first pair of claims are [JoP, A. 2. 1.3]. The second pair of claims are 
found in [Borl, Theorem 2.15.2] and [Awo, Proposition 8.8], respectively. 

□ 

Given a database instance /, updates on / include deletion of rows, insertion 
of rows, splitting (one row becoming two), and merging (two rows becoming one). 
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In fact, we classify insertions and merges together as progressive updates and we 
classify deletions and splits together as regressive updates. Then every update can 
be considered as a regressive update followed by a progressive update. 

Definition 3.5.5. Let C be a schema and / £ C-Inst an instance. A progressive 
update on I consists of an instance J and a natural transformation p: I — > J. 
A regressive update on I consists of an instance J and a natural transformation 
r: J — >• I. That is, a regressive update is just a progressive update in reverse. An 
update is a finite sequence of progressive and regressive updates. 

Proposition 3.5.6. Let C be a schema and I an instance on C. Any update on I 
can be obtained as a single regressive update followed by a single progressive update. 

Proof. The composition of two progressive (resp. regressive) updates is a progres- 
sive (resp. regressive) update. Hence any update on Iq := I can be written as a 
diagram D in C-Inst: 




Iq I\ h • ■ ■ In 

But the limit of this diagram (which can be taken, if one wishes, by taking fiber 
products such as Ij-i.j — > Ij Ijj+i, and repeating " ~" times), is what we need: 

I a <r- lim-D -4 /„. 

□ 

Remark 3.5.7. Another way to understand deletes is via filtering — one filters out 
all rows of a certain form. Filtering will be discussed in Section 5.3. 

3.6. Grothendieck construction. In Section 2.4 we showed how to convert an 
instance /: C — > Set to a new category Gr(I), called the Grothendieck construction 
or category of elements of /. This construction models the conversion from relational 
to RDF forms of data. There is a reverse construction that is described in [Sp4, 
Proposition 2.3.9]. 

We note here that there is a more general Grothendieck construction that may be 
useful in the context of federated databases. In programming languages theory, one 
may hear of a category of kinds, each object of which is itself a category of types. 
Here, each kind is analogous to a schema, and each type in that kind is analogous 
to a table in that schema. Given a category of kinds, we can "throw all their types 
together" by applying the generalized Grothendieck construction. This is akin to 
taking a federated database (i.e. a schema of related schemas) and merging them 
all into a single grand schema. One can apply this construction at the data level 
as well, merging all the instances into an instance over the single grand schema. 

The version of the Grothendieck construction given in Section 2.4 is for functors 
/: C — >• Set. Each object c £ Ob(C) corresponds to a table whose set of rows is 1(c). 
One can find a description of this construction in [MM, Section 1.5]. The version of 
the Grothendieck construction in which federated schemas are combined into one 
big schema is for functors D: C — > Cat. Each object c £ Ob(C) corresponds to 
a database whose category of tables is D(c). One can find a description of this 
construction in [JoP, B. 1.3.1]. 
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Table 1 . Dictionary between database terminology and category 
theory terminology. 



Dictionary between DB and CT terminology 


Database concept 


Category-theory concept 


Database schema, C 


Category, C 


Table T € C 


Object T E Ob(C) 


Column / of T 


Outgoing morphism, / : T — > ? 


Foreign key column / of T pointing to U 


Morphism / : T -> U 


Sequence of foreign keys 


Composition of morphisms 


Primary key column ID of T 


Identity arrow, idr '■ T — > T 


Controlled vocabulary (i.e. one-column 
table D) 


Object D without outgoing mor- 
phisms (except i&D : D — > D) 


Foreign key path equivalence in C 


Commutative diagram in C 


Instance I on schema C 


Functor I: C ->• Set (or I: C ->• S 
for some other "nice" category S) 


Conversion of Relational to RDF 


Grothendieck construction 


Insertion update u : Iq — > I\ 


Natural transformation u: Iq — > I\ 


Deletion update u: Iq — > h 


Natural transformation u : Ii — > Jo 


Schema mapping C — > T> 


Functor F : V -)■ C 


Basic ETL process C-Inst — > 2?-Inst 


Pullback functor, often denoted 
A F or F* : C-Set -)■ D-Set 



3.7. Dictionary. Our hope is that this paper will serve as a dictionary, whereby 
results from category theory literature can be imported directly into database the- 
ory. In Section 4 we will see such a result: translations between schemas provide 
data migration functors that have useful and provable properties. In Table 1 we 
gather some of the foundational links between databases and categories, as pre- 
sented throughout the paper. 

4. Data migration functors 

Given schemas C and T>, a data migration functor is tasked with transforming 
any C-instance I into some T>- instance J (or vice versa). Moreover, it must do so 
in a natural way, meaning that progressive (resp. regressive) updates on I must 
result in progressive (resp. regressive) updates on J. Data migration functors were 
concretely exemplified in Section 2.2. 

In this section we will start with a translation between schemas F : C — > T>. 
Recall from Definition 3.3.1 that this is simply a mapping from vertices in C to 
vertices in T> and arrows in C to paths in T>, respecting path equivalence. Any 
translation F generates three data migration functors. These will be denoted 

S F : C-Inst P-Inst A F : P-Inst -)■ C-Inst n F : C-Inst P-Inst. 

One may notice that Ap seems to go backwards — the direction opposite to that 
of F. Although this may seem counter-intuitive, in fact Ap is the simplest of the 
three data migration functors and the most straightforward to describe. 

Before we do so, let us quickly discuss value categories. Recall from Definition 
3.5.1 that for any category S, we have a category C-Instg of C-instances valued in 
S, i.e. the category of functors C — > S. The migration functor Ap: 2?-Inst§ — >• 
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C-Insts exists regardless of ones choice of S. For to exist, S must be cocomplete, 
and for lis to exist, § must be complete. To fix ideas, most readers should simply 
take S = Set, unless they are compelled to do otherwise. This is the case where 
the rows of each table form a set. 

4.1. The pullback data migration functor A. Suppose we have a translation 
F: C — > T>. Given a P-instance, / £ 23-Instg, we need to transform it to a C- 
instance in a natural way. But this is simple, because /: T> — > 8 is a functor and 
the composition of functors is a functor, so the composite 

C A V 4 S, 

is an object of C-Insts, as desired. Similarly, a natural transformation to: / — > J 
is whiskered with F to yield a natural transformation (mo F): (I o F) — > (Jo F). 
Thus we have defined a functor 

A F : P-Insts -»■ C-Inst s , A F (-) := (- o F) 

The slogan is 11 Ap is given by composition with F." 

We have now defined the pullback functor Ap : D-Insts —5- C-Insts . It was 
explicitly discussed in Example 2.2.1. Roughly, it can accommodate: renaming ta- 
bles, renaming columns, deleting tables, projecting out columns, duplicating tables, 
and duplicating columns. 

4.2. The right pushforward data migration functor II. Suppose we have a 
translation F: C — > V. Given a C-instance, / 6 C-Insts, we need to transform it 
to a P-instance in a natural way. We will do so by using the right adjoint of A F ; 
however, to do this we will need to assume that S is complete (i.e. that S has small 
limits). Note that § = Set is complete. 

Proposition 4.2.1. Let F: C — > T> be a functor, and let S be a complete category. 
Then the functor Af : P-Instg — > C-Insts has a right adjoint, which we denote by 

n F : C-Instg -)■ P-Insts. 

Proof. This is [Mac, Corollary X. 3. 2]. 

□ 

We have now defined the right pushforward functor 11^ : C-Insts — > P-Instg. It 
was explicitly discussed in Example 2.2.3. Roughly, it can accommodate: renaming 
tables, renaming columns, and joining tables. To see this, one applies the "pointwisc 
formula" for right Kan extensions, e.g. as given in [Mac, Theorem X.3.1]; a more 
explicit formulation is given in [Sp3]. 

4.3. The left-pushforward data migration functor S. Suppose we have a 
translation F : C — > T>. Given a C-instance, I € C-Insts, we need to transform it 
to a 2?-instance in a natural way. We will do so by using the left adjoint of Af; 
however, to do this we will need to assume that S is cocomplete (i.e. that S has 
small colimits). Note that S = Set is cocomplete. 

Proposition 4.3.1. Let F: C T> be a functor, and let S be a cocomplete category. 
Then the functor Ap - X>-Insts — > C-Insts has a left adjoint, which we denote by 

S F : C-Insts -> D-Inst s . 
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Proof. This follows from [Borl, Theorem 3.7.2] and [BW2, Proposition 9.11]. 



We have now defined the left pushforward functor S p : C-Instg — > P— Instg . It 
was explicitly discussed in Example 2.2.5. Roughly, it can accommodate: renaming 
tables, renaming columns, taking the union of tables, and creating Skolem variables. 
To see this, one applies the "pointwise formula" for left Kan extensions, e.g. as given 
in [Borl, Theorem 3.7.2]; a more explicit formulation is given in [Sp3]. 



In this section we will formulate the typing relationship that holds between ab- 
stract data and its representation. Until now, we have been considering data as 
simply a collection of interconnected elements — an instance in the sense of Defini- 
tion 3.2.8 keeps track of various sets of abstract elements, segregated into tables 
and connected together in precise ways. However, in reality, each such element is 
represented in its table by way of a datatype, such as strings or integers. Seman- 
tically, each cell in a given column c: t — > t' of a table t should have the same 
datatype, namely they should all have the datatype of the target table, t'. 

However, datatypes not only give a uniform method for displaying each data 
element, but they can also carry a notion of value. For example, salaries are numbers 
that can be added together to give meaningful invariants. It is for this reason 
that we must find a connection between database formalism and programming 
language formalism, as was discussed in Section 2.5. Category theory provides such 
a connection: both database schemas and programming languages form categories, 
and these categories can be related by functors. 

Below we will explain these concepts, in particular how to attach datatypes from 
a programming language to tables in a categorical schema. To do so, we will make 
use of the concepts in Section 4. After defining type signatures on schemas we will 
proceed to define morphisms of type signatures, which will enable us to filter data. 
For example, to filter all names that start with the letter R, we might pull back 
along an inclusion {R} — > Str, where Str is the set of strings. 

5.1. Assigning data types via natural transformation. We begin with two 
examples to motivate the definition. 

Example 5.1.1. What is meant mathematically by the phrase "a set of integers"? 
Consider that it is a set labeled X, together with a function f:X—*Z. Allowing 
our set of integers to change, we get different sets labeled X and different functions 
labeled /, but the set Z of integers is unchanged. From the categorical perspective 
we can understand "a set X of integers" in a couple different ways: 
(1) as a database instance /: C — > Set on the schema 



□ 



5. Data types and filtering 



C := 



x / z 



• s-» 



such that the image on • is fixed as /(•) = Z; or 

(2) as a database instance J: V — > Set on the schema 



V : 



x 
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equipped with a natural transformation / : J 
hand for the functor V — > Set given by P(») = 



-> {Z} (where {Z} is short- 

:Z). 



Example 5.1.2. Suppose we have a database and that we would like to enforce a 
mathematical relationship between two columns, say t and d, in a certain table X. 
For example, it might be that column t, say "time spent" (in an integer number of 
hours) is related to column d, say "debt owed" (as a dollar-figure) by a mathematical 
function d = r(t), say 

d(x) = r(t(x)) := $50*%) 
for each x 6 X. Just as in Example 5.1.1, this situation could be categorically 
represented in a couple ways, but we will focus on only one. Namely, we understand 
it as the collection of database instances J : V — > Set on schema T> as exemplified 
below 



(16) 



V := 



d = tr 




X t Y 


• s- a 


t 




r 


d \ ' 




^ z 


• 



J(X) := 



X 


ID 


t 


d 


CtrX13 


4 


$200 


CtrX14 


7 


$350 


CtrX15 


2 


$100 



The fact that the d has dollar-figure datatype and that d = $50 * t are enforced by 
a certain natural transformation, / : J — >• P, where P is a typing instance. We will 
give more details in Example 5.1.7. 

Definition 5.1.3. Let C be a schema and let P G Ob(C-Inst) be an instance. The 
category of P-typed instances on C, denoted C-Inst/p, is defined to be the "slice" 
category of instances over P (see, e.g. [MM, Categorical Preliminaries]). In other 
words, a P-typed instance on C is a pair (I, r) where / is an instance and r : I — >• P 
is a natural transformation; and a morphism of P-typed instances is a commutative 
triangle. 

Remark 5.1.4. Given a schema C we may refer to any instance P S Ob(C-Inst) 
as a typing instance if our plan is to consider P-typed instances, i.e. the category 
C-Inst/p. 

Remark 5.1.5. Fix a schema C and a category S and let £ :— C-Inst§ denote the 
category of S-valued instances on C. In practice £ is often a topos (see Proposition 
3.5.4). In case it is, then for any instance P G Ob(£), the category £/p of P-typed 
S-valued instances on C is again a topos (see [MM, Theorem IV. 7.1]). 

Construction 5.1.6. Suppose given a category Type of types for some program- 
ming language and an S-valued functor V : Type — > S which sends each type to its 
set (or S-object) of values. We often wish to use a fragment of Type to add typing 
information to our database schema C. If the fragment is given by the functor 

B Type and B is associated to the schema via a functor B —> C, 



S 



Type^ 



B 

then P :— HgoAjr(V) is the implied typing instance. We call the sequence (B, F, G) 
the typing auxiliary in this setup. 
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Example 5.1.7. We return to Example 5.1.2 with the language from Construction 
5.1.6. We will now describe a typing auxiliary. Let B be one-arrow category drawn 
below, and let G : B — )■ V be the suggested functor 



z' 



d = tr 




x t Y 


• *- < 


» 




r 


d\ ■ 




^ z 


• 



Consider also the functor F: B — > Type sending Y' to Int, the type of integers, 
Z' to Dollar the type of dollar figures, and r' to the function that multiplies an 
integer by 50. 

With V: Type — > § as in Construction 5.1.6, the implied typing instance P := 
n G o A F (V): V -> S has 

P(X) = Int x Dol; P(Y) = Int; P(Z) = Dollar, 

and P(r) : P(Y) — > P(Z) is indeed the multiplication by 50 map. 

Now a P-typed instance r: J — > P is exactly what we want. For each of X, Y, 
Z it is a set with a map to the given data type, and the naturality of r ensures the 
properties described in Example 5.1.2 (i.e. that d = $50 * t in the table J(X) from 
Display (16)). In other words, it ensures that for any row in J(X), the value of the 
cell in column d will be 50 times the value of the cell in column t. 

5.2. Morphisms of type signatures. Each data migration functor discussed in 
Section 4 is a kind of tool for schema evolution. As new tables and columns are 
created and others are discarded, the translation between old schema and new will 
induce data migration functors that convert seamlessly from old data to new (and 
vice versa) and from queries against the old schema to queries against the new one 
(see also [Sp4]). 

There is a slightly different kind of schema evolution that comes up often, namely 
changing data types. For example, if a company surpasses around 32,000 employees, 
they may need to change the datatype on their Employee table from a smallint to a 
bigint. More complex changes include cutting the price for every share of stock by 
half, or concatenating a first and a last name pair to form a new field. Importantly, 
one needs to be able to reason about how queries against today's schema will 
differ from those against yesterday's. Change-of-types functors are just like data 
migration functors, and the formal nature of their description allows one to reason 
about their behavior. 

Let C be a schema and let £ be the topos C-Inst. Given a morphism of typing 
instances k: P — )■ Q, there are induced adjunctions 

A fc n fc 

In other words, k induces an essential geometric morphism of toposes ([MM, The- 
orem IV.7.2]). 
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Definition 5.2.1. Let C be a schema and k: P — » Q a morphism of typing in- 
stances. We refer to the induced functors 

Sfe, Ilk ■ C-Inst /p —> C-Inst /q : C-Inst iq — > C Inst ip 

as type-change junctors. To be more specific, will be called the left pushforward 
type- change functor, Ilk will be called the right pushforward type- change functor, 
and Afc will be called the pullback type-change functor. 

Example 5.2.2. We return to Example 5.1.2 with V and P as defined there. Consider 
the left pushforward type-change functor in the case that k: P —> Q sends a 
dollar figure x to True if x > $200 and False if x < $200. The functor converts 
the table on the left to the table on the right below: 



X 




X 


ID 


t 


d 




ID 


t 


d 


CtrX13 


4 


$200 


Efc(r) = 


CtrX13 


4 


True 


CtrX14 


7 


$350 




CtrX14 


7 


True 


CtrX15 


2 


$100 




CtrX15 


2 


False 



The other type-change functors, Hk and not have useful results in the context 
of this particular example, but see Examples 5.2.3 and 5.3.1. 

The right type-change functor II handles what might be called "group satisfac- 
tion." Suppose we have a bunch (P) of people and a set (J) of items are distributed 
among them (r : I — > P). If the people are then subdivided into groups (fc : P — > Q), 
then we can ask each group q G Q, "how many ways are there for each of your 
members to offer up one of their items?" (cardinality of il/ £ (r) _1 (q) C 11^(7)). For 
example, if one of the people was handed an empty set of items then his or her 
group will have no such joint offering (IIfe(T) _1 (g) = 0). We now explain this by 
example. 

Example 5.2.3. Let S be a complete category and write C-Inst instead of C-Instg. 
In this example we explain how, given a morphism of typing instances fc : P — > Q 
on a schema C, the type-change functor 11^ : C-Inst /p — )■ C-Inst /q operates on a 
P-typed instance to return a Q-typed instance by what we called group satisfaction 
above. Suppose we have C and B as drawn, 



with the functor G : B — > C given by L' H> L and the functor P' : B — > S given by 
P'(L') = {1,2, 3, 4}. Let Q' : B -> S be given by Q'(L') = {x, y} and let k' : P' -> Q' 
be the map sending 1, 2 x; 3,4h-j/. Finally, let 

P := II G (P'), Q = n G (Q'), and k := II G (fc') : P -> Q. 

We are ready to compute the right type-change functor along k on any P-typed 
instance I — » P; we just need to write down some such instance. So, if 7 € C-Inst 
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is the table on the left then Tlk(I) is the table in the middle: 



L 


ID 


f 


a 


1 


b 


2 


c 


1 


d 


3 


e 


2 


f 


4 


g 


2 



n fe (i) 



L 


ID 


f 


(a,b) 


X 


(a,e) 


X 


( a :g) 


X 


(c,b) 


X 


(c,e) 


X 


(c,g) 


X 


(d,f) 


y 



b 

e 



d f 



1 2 



y 



and a sketch of the reasoning is given on the right. 

5.3. Filtering data. In this section we show how to use pullback type-change 
functor A to filter data (e.g. answer queries like "return the set of employees whose 
salary is less than $100." 

In fact, given a morphism of instances k: P — > Q, the associated pullback type- 
change functor Afc can either filter or multiply data (or both) depending on the 
injectivity or surjectivity of k. In this short section we concentrate only on filtering, 
because it appears to be more useful in practice. We work entirely by example; the 
definition of Delta is given in Section 5.2 or in the literature ([MM, Section IV. 7]). 

Example 5.3.1. As advertised above, we show how to filter employees by their 
salaries, in particular showing only those with salaries less than $100. 

Suppose we have a schema C and two typing auxiliaries, (B, P, G) and (£>, P', G), 
shown below: 

C := 



Type 






Name 




• 


Employee 




• 






^ Salary 




• 



Here we want Q'(Salary) to be the dollar-figure data type, we want P'(Salary) to 
be the subtype given by requiring that a dollar figure x be less than $100, and we 
want k' : P' — > Q' to be the inclusion. These typing auxiliaries induce a morphism 
of typing instances 

k := n G (fc') : P ^ Q, where P := Hg(P') and Q := H G (Q')- 

Now suppose that / S C-Inst is the Q-typed C-instance shown to the left below. 
Then 11^(7) is the P-typed instance to the right below. 



/ := 



Employee 










ID 


Name 


Salary 




Employee 


EmlOl 


Smith 


$65 




ID 


Name 


Salary 


Eml02 


Juarez 


$120 


n fc (/) = 


EmlOl 


Smith 


$65 


Eml03 


Jones 


$105 




Eml04 


Lee 


$90 


Eml04 


Lee 


$90 




Eml05 


Carlsson 


$80 


Eml05 


Carlsson 


$80 
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Thus we see that filtering is simply an application of the same data migration 
functor story. 

5.4. A normal form for data migration. We have seen several different forms 
of data migration functors throughout this paper. One may perform a sequence 
of data migration functors, e.g. moving data from one schema to another, then 
changing the data types, and finally filtering the result. Any such combination 
of data migration functors can be rewritten as a sequence of three: a pullback, 
a left pushforward, and a right pushforward. This is proven in [GK, Proposition 
1.1.2]. To us, it means that there is a normal form for a quite general class of 
queries that includes any combination of projections, duplications, unions, joins, 
group satisfactions, filtering, and changing data types. Any combination of such 
queries can be written in the form AfTIg^h for some F, G, H. This form may not 
be optimal in terms of speed, but it can serve as a single input format for query 
optimizers. 
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